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Abstract 



Learning latent "expression themes" that best express complex patterns in a sample is a central prob- 
lem in data mining and scientific research. For example, in computational biology we seek a set of 
salient gene expression themes that explain a biological process, extracting them from a large pool of 
gene expression profiles. In this paper, we introduce probabilistic models to learn such latent themes in 
an unsupervised fashion. Our models capture "contagion", i.e., dependence among multiple occurrences 
of the same feature, using a hierarchical Bayesian scheme. Contagion is a convenient analytical formal- 
ism to characterize semantic themes underlying observed feature patterns, such as "biological context". 
We present model variants tailored to different properties of biological data, and we outline a general 
variational inference scheme for approximate posterior inference. We validate our methods on both sim- 
ulated data and realistic high-throughput gene expression profiles via SAGE. Our results show improved 
predictions of gene functions over existing methods based on stronger independence assumptions, and 
demonstrate feasibility of a promising hierarchical Bayesian formalism for soft clustering and latent as- 
pects analysis. 

Keywords: Contagion processes, Poisson distribution, negative-binomial distribution, hierarchical Bayesian 
models, mixed membership, variability allocation, approximate posterior inference, mean-field approxi- 
mation, serial analysis of gene expression (SAGE) data. 



1 Introduction 



As a consequence of the information glut society faces, a fundamental issue in data mining and scientific 
pattern discovery is that of finding a useful representation of complex systems, which is amenable to math- 
ematical and statistical learning analyses. A useful representation would summarize the plethora of feature 
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patterns observed in a sample with a small set of typical "feature expression themes" that replicate the 
variability of the observations. For example, in computational biology, we may seek a set of salient gene 
expression patterns, not directly observable, that explain a biological process from a large pool of observed 
gene expression profiles; in text analysis, we may seek the set of latent topics, i.e., typical word distributions, 
that best explain a collection of documents. The set of latent themes for feature can then be used for further 
analyses about the behavior of the whole system. 

The task of identifying latent themes is essentially a clustering problem, where we have little or no 
information about the properties of the themes/clusters we seek. For any given number of latent themes^, 
we seek to allocate observed feature expressions to possible underlying themes; or, in other words, we need 
to cluster objects (e.g., genes or documents) that are similar in terms of their observed feature expression 
profiles into coherent themes. Among existing approaches to this task, Cai et al. (2004) introduce a variant 
of K-means algorithm that minimizes non-standard scoring functions, based on the Chi-square statistic, and 
the Poisson distribution of feature expression rates — see Section 6 for more details. This approach, however, 
constrains all feature expression levels measured on the same object to follow the expression profile typical 
of a single theme. Pritchard et al. (2000) relax this assumption, and posit that feature expression levels 
measured on the same object (in their case, the occurrences of a defined set of genetic polymorphisms in 
an individual) are mixtures of the expression profiles typical of several themes, i.e., population- specific 
frequency of each polymorphism. This model was independently re-discovered in the machile learning 
community Minka and Lafferty (2002); Blei et al. (2003) with the goal of learning topics from a collection 
of documents — we refer to this popular model as the "independence model" in the remaining of this paper. 
Recently, there has been a flurry of research on soft clustering in the machine learning and computational 
biology communities Cohn and Hofmann (2001); Rosenberg et al. (2002); Xing et al. (2003); Griffiths and 
Steyvers (2004); Buntine and Jakulin (2004). The proposed models, however, are often unrealistic and fall 
short of replicating the true marginal variability profiles of the observations. In particular, as discussed in 
Section 2.1 and Section 2.2, existing models appear to be unsuitable for the biological application (i.e., 
SAGE analysis) we concern in this paper. 

In this paper, we introduce a hierarchical Bayesian formalism to address these problems. Briefly, our 

^Such "themes" usually correspond to parametric formulations of the feature generation process Airoldi et al. (2006b). 
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models 

1. learn latent feature expression themes from data in an unsupervised fashion, 

2. enable domain- specific information to be incorporated in the form of priors on the hyper-parameters 
at the top of the hierarchy; 

3. assume that each feature may be instantiated under various themes to different degrees (i.e., feature 
emission probabilities are mixtures). 

Furthermore, we introduce the notion of contagion, which refers to the existence of dependences among 
subsequent occurrences of the same feature, when modeling objects on the basis of a generative process. 
Contagion is a convenient analytical formalism to capture richer variability profiles than current models al- 
low for, and it characterizes plausible semantic themes, such as "biological context", underlying the observed 
feature patterns. In this paper, we present an analysis based on the contagion process of gene expression 
profiles measured via the SAGE technology. We show a new approach to summarize samples of gene ex- 
pression data into latent gene expression themes, and compare our approach with extant algorithms. It is 
worth pointing out that the models we present here apply to a wider array of problems, e.g., the summa- 
rization of a collection of scientific publications into latent word frequency profiles, typically referred to as 
topics in the machine learning community. 

Here is the plan for the rest of paper. We introduce the biological problem, and motivate the contagion 
process in Section 2. Then we present several variants of our model tailored to different properties of gene 
expression data in Section 3. A general variational inference scheme for approximate posterior inference is 
outlined in Section 4. And we validate our methods on both simulated data and realistic high-throughput 
mouse retinal gene expression profiles via SAGE in Section 5. 



2 The Biological Problem 

Serial analysis of gene expression (SAGE) Vesculescu et al. (1995) is a technology that quantitatively mea- 
sure the copy numbers of mRNA transcripts, simultaneously for a large number of gene in a biological 
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sample, such as a cell population or a tissue. 

A SAGE experiment begins by sampling a total of B transcripts at random from a biological sample un- 
der some specific condition (e.g., a cell cycle stage), and then use N gene-specific tags to probe the existence 
of possible genes in each of the B transcripts. Let X5 = (-^51, -^52, • • • , ^hN^ 1 -^6n ^ {0, 1}, Yin ^bn = 1 
be a unit-base indicator vector recording the probing results for transcript b (i.e., = 1 indicates that gene 
n is present on transcript 6). The number of mRNA copies of a gene n, denoted by Yn, and the vector of 
copy counts for all genes (i.e., an expression profile), y = (Yi, I2, • • • , ^a^)^, can then be simply expressed 
as: 

B B 

y, = 5^X5,, Y^J^x,. (1) 

b=i b=i 

Note that 1^'s are each binomial distributed, controlled by gene-specific parameters pi:N each captures the 
probability of occurrence of gene on a random transcript, and a common sample size parameter B. When 
multiple cellular conditions are of interest, for example, stage sequences in a cell cycle, we can index an 
expression profile with its sample condition, e.g., Y^, for measurements obtained at time t. 

The main random quantities of interest are: the observed gene expression levels 1^'s, for the n-th gene 
at the t-th epoch; the observed gene expression profiles Y^'-^'s, for the n-th gene; and the latent gene 
expression themes, e.g., p^-^ or A^-^, for the k-th theme, as defined in Pritchard et al. (2000) and in the 
basic model of Sections 3.1, respectively. Technically, the latent gene expression themes are multivariate 
emission probabilities for the gene expression levels, conditionally on the "active" theme. The notation we 
adopt puts forward the set of parameters underlying a specific distribution, e.g., A^'^ is a vector of Poisson 
rates, which control the expression levels of those genes that are expressed according to the k-th theme. For 
example, whenever the n-th gene is expressed according to the k-th theme we have 

Y^--^r^[PoisiXl),...,PoisiXl)]. 
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2.1 Analytical Justifications of Contagion 



In the biological problem above, we often face situations where occurrences of the same gene under single 
and multiple conditions are not independent of one another, because they are sampled from a cell population 
or a tissue that provides a specific "biological context". Contagion processes provide a useful analytical 
mechanism to capture this notion. The two generative models we propose for analyzing temporal gene 
expression data {Y^'^}^^^, which instantiate the contagion process, are based on the a Poisson and a 
negative-binomial distribution of integer counts^, at different levels. 

Our choices were motivated by few main considerations. The Poisson distribution offers a computational 
advantage over the binomial distribution. We can reasonably assume that the gene-specific probabilities of 
occurrence pi-N are very small, given that there is a large amount of transcripts present in a specific bi- 
ological sample. Consequently, it is reasonable to approximate the binomial probabilities with Poisson 
probabilities, as well as computationally efficient. The sampling algorithms underlying both the Poisson 
and negative-binomial distributions lead to marginal and conditional^ distributions for the gene expression 
levels with desirable properties. Assuming Poisson or negative-binomial conditional emission probabili- 
ties relaxes the assumption that, in the (sequential) sampling process described in Section 2, subsequent 
observed instances of the same gene tag are independent. In fact, such independence leads to binomial 
conditional emission probabilities Pritchard et al. (2000). The dependence among different observations of 
the same gene tag at the conditional level is a one characteristic of the notion of contagion we introduce. 
Another characteristic of our notion of contagion is found at the marginal level. Recall that ideally we 
identify themes that can be interpreted as "biological or functional contexts". Following the intuition that 
each gene may be expressed under multiple biological contexts to a different degree, we model the prob- 
ability of observed gene expression levels, y^, as a mixture of conditional emission probabilities, where 
the gene-specific mixture weights given by the mixed membership vectors. On, are constant over time (or 
across experimental conditions). The mixing leads to marginal distributions that are more skewed than the 
corresponding conditional distributions. This is the "contagion effect" more popular in the literature"^ Si- 

^For a review of various parameterizations, and the corresponding estimators we refer to Airoldi et al. (2005), Johnson et al. 
(1992) and Kadane et al. (2006). 

^Conditionally on the "active" theme. 

"^Although this second characteristic of contagion processes is more common in the literature, there is an subtle point to notice in 
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Table 1: Methods-of-Mometns estimates of negative-binomial parameters for gene expression levels in 
mouse retinal cells of at 10 different stages of development Cai et al. (2004). A discussion of the estimators 
is given in Airoldi et al. (2005). 



Epoch 


mean 


var. 


Vvar. 

lllCd.il 


a 




1 


30.1172 


150.8648 


2.2381 


11.1733 ±0.3655 


4.3000 ± 0.2155 


2 


26.5542 


163.8892 


2.4843 


9.8514 ± 0.4075 


6.1021 ±0.3304 


3 


28.1718 


155.4820 


2.3493 


10.4516 ±0.2936 


2.9376 ±0.1448 


4 


31.5446 


204.2503 


2.5446 


11.7029 ±0.3267 


3.2591 ±0.1588 


5 


26.0307 


94.4013 


1.9043 


9.6572 ± 0.4154 


6.4720 ± 0.3562 


6 


26.6489 


82.0171 


1.7543 


9.8866 ±0.2118 


1.5748 ± 0.0795 


7 


27.3122 


82.0405 


1.7331 


10.1327 ±0.2491 


2.1565 ±0.1066 


8 


25.1990 


53.6102 


1.4586 


9.3487 ± 0.2637 


2.6407 ±0.1319 


9 


27.1513 


89.7169 


1.8178 


10.0730 ± 0.4472 


7.2014 ± 0.4008 


10 


20.8160 


81.2509 


1.9757 


7.7226 ± 0.5975 


16.8959 ± 1.3156 



mon (1955). For example, in the case where the conditional probabilities are Poisson, their mixing would 
increase the variability of the expression levels. A formal model of contagion that encodes this intuition is 
the negative-binomial model, which arises as an infinite Gamma mixture of Poisson distributions. These 
arguments support our distributional choices. From a data analysis standpoint, the marginal distributions 
that encode contagion fit well the observed expression levels. 

To summarize, contagion processes are the result of latent regularities present in structured data, such 
as the SAGE profiles we study. The inherent topical structure of the data, i.e., the fact that genes may be 
expressed under several latent themes, leads to hierarchical mixing of emission probabilities, and, ultimately, 
to the over-dispersion of gene expression levels. 



2.2 Empirical Evidence 



Our motivating example is the set of mouse retinal SAGE libraries analyzed in Cai et al. (2004). The raw 
mouse retinal data consists of 10 SAGE libraries (38,818 unique genes that appeared more than twice in 

latent aspect models that feature independence of subsequent observed instances of the same gene tag Pritchard et al. (2000); Minka 
and Lafferty (2002); Blei et al. (2003). Specifically, if we model themes as multinomial distributions, then Dirichlet distributed 
mixing weights will not alter the mean-to-variance ratio of the marginal distribution, which is still multinomial. Rather, the main 
effect of mixing is an increased variability. 
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the sample) from developing retina taken at 2-day intervals, ranging from embryonic day to postnatal day, 
and adult, for total of 10 epochs Blackshaw et al. (2004). Of the 38,818 genes, 1,467 that appeared more 
than 20 times in at least one of the 10 libraries were selected. These 1,467 genes were purported as the 
potentially most biologically relevant because of their high frequency of occurrence. The data analyzed 
in this paper consists of the pool of observed expression profiles (y^, y^, . . . , Y^^) for the 1,467 selected 
genes, measured at ten epochs during the development period. 

We tested the distributional intuitions we discussed in §2.1 on the SAGE data. In Table 1 we report 
summary statistics and estimates for the negative-binomial parameters described in Airoldi et al. (2005). 
Our exploratory data analysis confirms the expected over-dispersion of the gene counts, suggested by the 
"mixture of Poisson distributions" hypothesis. Moreover, the estimates of the extra-Poissonness parameter 5 
are all positive^ with very high probability, as indicated by a quick inspection of the corresponding standard 
deviations. Lastly, note that the log transformation Q = log(l + 5) is effective in reducing the heavy tail of 
the distribution of 5. Thus, we prefer to work on the Q scale, where positing a simple prior is sensible. 

In conclusion, the SAGE data we analyzed are not under-dispersed (i.e., variance < mean), as is im- 
plied by treating the random variables {X^-^} as Bernoulli processes Pritchard et al. (2000); Rosenberg 
et al. (2002). Such an assumption leads to clustering models based on Multinomial latent profiles and bino- 
mial emission probabilities for feature counts Blei et al. (2003); Griffiths and Steyvers (2004); Buntine and 
Jakulin (2004), which are not often warranted. 

3 Contagion Processes 

In this section we introduce two hierarchical Bayesian generative processes for clustering SAGE data into 

expression profiles in an unsupervised fashion. These models capture "biological context" through the 

notion of contagion. Recall that we observe sequences of gene tag counts (y^, y^, . . . , Y^) that measure 

the expression level (i.e., the abundance) of the n-th gene in the target cell or tissue across epochs 1 though 

T. In the models below we assume there is a fixed number, K, of latent expression profiles, and that genes 

are expressed under different profiles to different degrees. 

^Recall that as 5 ^ the negative-binomial density degenerates into a Poisson density. 
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Figure 1: Graphical representation of the generative processes of contagion based on the Poisson (top left) 
and negative-binomial sampling schemes. The representation for the processes of contagion based on the 
Poisson sampling scheme for the non-basic models are easily obtained, by removing the part of the graphical 
models depending on 6. In fact, recall that S is the extra-Poissonness parameter, and as 5 ^ the negative- 
binomial density converges to the corresponding Poisson limit. We refer to Johnson et al. (1992) for more 
details. 



3.1 Poisson Generative Process 

The first generative process we propose is based on the Dirichlet and Poisson distributions. There are four 
flavors of the Dirichlet-Poisson generative process: basic (bDiP), normalized (nDiP), conditional (cDiP), 
and smoothed (sDiP). In the "basic" model we explicitly posit the "mixed-membership" of genes to latent 
profiles by assigning to each gene a Dirichlet vector of probabilities, 0^ In order to generate the observed 
expression levels Y^'-^ of the n-th gene, assuming K latent expression profiles, we proceed as follows. 

1. Sample 

2. For each epoch t = 1, . . . , T 

2.1. Sample ^ Multinomial {On^ 1) 

2.2. Sample yl^ Poisson {MkWnj^ = !)• 
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The genes are the sampling units in SAGE experiments, and the total volume of their expressions often 
vary over time. We want to recover "calibrated" expression profiles that do not depend on the total expres- 
sion volume. Therefore, we posit the "normalized" model in order to rescale the samples (i.e., the genes) 
according to their different sizes (the total expression volumes), and ultimately improve the parameter es- 
timates. In the basic model, the matrix A = {\tk} contains the rates that govern the expression level of 
genes at T different epochs for each of the K different latent profiles. In the normalized model, the expected 
expression level of a gene Tn, at time t for profile fc, is 

Mk = • l^tk, (2) 

where Un is scalar and observed, and denotes the total expression level of gene as a multiple of a fixed total 
expression level /? used as a reference expression level. This new parameter (3 may a fixed pre-determined 
value, estimated via, e.g., empirical Bayes Carlin and Louis (2005), or given a distribution as part of a full 
Bayesian analysis Airoldi et al. (2006a). 

In both the basic and the normalized models above, the rows of the parameter matrices A and /x control 
the rates at which genes are expressed. In particular, \tk and iitk encode the expected expression level of 
genes at time t for profile k. Since profiles are by definition not observable, none of these parameters can be 
estimated directly from the data. 

We reparameterize the rows of the normalized rate matrix /x with the sum/ratio parameterization, i.e., for 
every epoch t we transform 

(Mu, Mt2, . . . , I^ik) — ^ (c^t, Pti, Pt2, . . . , Ptx), (3) 

where the sum parameter at := J2^=il^tk^ the ratio parameters ptk and the constraint that 

J2^=i Ptk = 1 makes the ratio parameter ptK redundant for each t. 

This reparameterization leads to the "conditional" model, where the sum parameters (ai, (J2, . . . , cft) 
are directly estimable from the data, and we can carry out inference conditionally on them. This is possible 
since the parameters at encode the total normalized expression level at time t, sum of the expression levels 
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over the profiles, which is an observable quantity as it does not depend on the latent profiles. Conditioning 
on the MLEs for the total expression parameters, at, leads to a new allocation problem where we need to 
infer the differential expression levels of genes under the K profiles. In other words, we need to "split" the 
total expression level at each time t, given by a direct estimate of cr^, among the latent profiles. 

Last, we introduce the "smoothed" model, where we posit a posit a prior for the differential expression 
rate parameters to smooth the estimates. It is possible to posit a prior distribution on the total expression rate 
parameters as well, but we choose not to. A brief analysis of the observed total rates suggests it is appropriate 
to apply a logarithmic transformation on them to stabilize the variability, and one can introduce a Gaussian 
prior on the transformed rates; however, an inspection of the total rates at over time (see Table 1) suggests 
that some other phenomenon is possibly going on, which leads to a decreasing occurrence of the genes in 
the SAGE libraries. Therefore we choose to use the observed total rates to inform our inferences directly, as 
in the conditional model^. However, in the smoothed model we sample the differential expression levels 

Pt . ^ Dirichlet K 

for each epoch t = 1, 2, . . . , T. See Figure 1. 

In conclusion, the Dirichlet-Poisson generative process possesses a few advantages: (1) this sampling 
scheme encodes contagion in the sense that multiple occurrences of the same gene tag at the same epoch 
depend on one another, under a specific latent expression theme; (2) this sampling scheme arises naturally in 
the biological experiments we are interested in as we discussed in §2.1; (3) computing Poisson probabilities 
is computationally more efficient than computing binomial probabilities, since we do not have to evaluate 
binomial coefficients. 

^Smoothing the overall rates {atk} would impose a model on data that we would not be able to justify, since we do not have an 
intuition of why the overall rates are declining. This would cast some doubts on the interpretability of the inferences such a model 
would lead to. 
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The generative process of contagion based on the negative-binomial sampling scheme is similar in spirit 
to the previous one based on the Poisson sampling scheme. A formal treatment of models along this line, 
however, would involve tedious parameterization details that is beyond the scope of this paper. Intuitively, 
the negative-binomial distribution has two parameters that control mean and variance, and the variance is 
greater than the mean — this is a useful feature to capture the observed over-dispersion of gene expression 
levels. Its density can be written as a Poisson density with an extra parameter 5 that controls the amount of 
extra-Poisson variability. Such a version of the density is our starting point, 



rates and {5tk} are profile- specific extra-Poissonness parameters. We then introduce the conditional model, 
where we apply the sum/ratio parameterization of equation 3 to both sets of parameters to obtain mappings 



NB ( y\ I UnlJ^t.^nSt) 



where Kt '= ^ for convenience of notation. In normalized model, {/J^tk} are the profile- specific Poisson 




. . . , Ptx) 



(4) 



{it, mil ^2 



(5) 



Finally, in the smoothed model we sample the differential extra-Poissonness parameters 



r]t . ^ Dirichlet K (7) 



for each epoch t = 1,2,... 



T. See Figure 1. 
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4 Approximate Posterior Inference 

Inference in these models is a challenging task. In fact, in order to obtain the posterior for the latent variables, 

( 'y^'-TX^ I So.^'-T\N S\^-T\K \_ P ( i^n, Zn^}n=l^ l^n^ln^l | {^k^}k=l ) /z^x 

P [ X^n^^n Jn=l \ Wn Jn^l'^' l^/c Jk=l )— / r i-T\N \ 7\VT\K \ ' 

P [ XVn in=l I ^' X^k Xk=l ) 

we need to compute the likelihood of the data, which is given by an integral with no closed form solution. 
The quantity at the denominator on the right hand side of Equation 6, for example, is the likelihood of the 
data we need in order to solve the Bayes problem in the basic model of Section 3.1. 

In order to perform inference in these models, we develop a mean-field approximation to the posterior 
that involves the substitution of an integrable lower bound for the likelihood. The mean-field approximation 
involves positing a simple distribution, q, over the latent variables, which depends upon an extra set of (vari- 
ational) free parameters, {z/^i, (t)n^}n=i' The free parameters are then set to minimize the Kullback-Leibler 
divergence between the true and approximate posteriors. This is equivalent to maximizing a lower bound for 
the likelihood within each E-step, over the free parameters, and then compute pseudo-exprectations for the 
latent variables off the tight lower bound. The overall inference algorithm is a variational EM scheme, which 
employs the mean-field approximation to carry out the E-step, as discussed above, and alternates with a reg- 
ular M-step, where the pseudo maximum likelihood estimates of the model parameters, e.g., (a, {A^-^}^^) 
for the basic model, are revised by further maximizing the lower bound for the likelihood. We iterate these 
two steps till convergence. 

The variational EM scheme just described practically translates into a coordinate ascent algorithm, where 
parameters are naturally organized into batches with similar semantics. The parameter updates correspond- 
ing to the model variants we considered here are summarized in Table 2. 

4.1 A General Bayesian Formalism for Latent Aspects Analysis 

The variational inference scheme we developed is quite general. In fact, the free parameter updates (that are 
used to maximize the lower bound for the likelihood within each E-step) take a generic form applicable for 
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Table 2: In the table below we summarize the parsimonious mean-field approximation for the various mod- 
els. The parsimonious mean-field approximation posits one latent expression profile indicator z for each 
(gene,epoch) pair. Note that T := ^ql^ogO^k]^ ^nd Po, NB, are short for Poisson, and Negative- 
Binomial, respectvely. ** Alternatively use the Method of Moments described in Airoldi et al. (2005) 
pretending to observe pseudo counts {0^^ • y^} as the expression levels of the n-th gene according to the 
k-th latent theme. 



PoissoN Negative-Binomial 



Basic z/^^ = q^/c + ^mk 

tk - En <Pntk 

al WITH NEWTON-RAPHSON 



Norm, z/^^ = Q^/c + Et ^mk 

^ntk T • Po ( I UJnm ) 

n 



al WITH Newton-Raphson 



(i>*ntk O^'^-^B {Vnl '^nm ) 

n ^ntkVn 
mk ~ Y.n't'ntk'^n 

54 = L-BFGS ** 

al WITH Newton-Raphson 



COND. U*^ = ttfc + Et <Pntk 

Ktk o^'^ ■Po{yi\ LUnCrtPtk ) 

Ptk 



nk 



^ (^k + Y.t ^ntk 



y 



I'ntk 
P*tk-- 
4k-- 



<xT-NB{yi 

_ (f^ntkyn 
J2n 4>ntk^nO-t 

= L-BFGS ** 



i^tptk ) 



al WITH Newton-Raphson 



al WITH Newton-Raphson 



all different conditional emission probability function we concern, e.g., Table 2. Furthermore, for a generic 
conditional emission probabilities piVnlf^D for all (n, t, k), with parameter set we obtain the 

generic the free parameter updates 

Ktk -p {yi \ Pk)^ 

where T e^^I^^^^-^l as in Table 2. The updates for z/*^ do not change. 

The generality of the approximate E-step in latent aspects analysis that feature one latent group indica- 
tor, z^, for each gene-epoch pair (n, t) is due the specific hierarchical formulation of our models. Such a 
formulation posits exchangeable measurements on features, e.g., gene expression levels at each epoch. Dif- 
ferent conditional emission probabilities only lead to different estimators for the corresponding parameters, 
{/?^^^}f=i,intheM-step. 
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5 Experiments 

A non-trivial difference in the generative process with respect to the "independence model" in Pritchard 
et al. (2000); Minka and Lafferty (2002); Blei et al. (2003) has far reaching implications for applications. 
For example, models of contagion provide a better fit for data with realistic mean-to-variance marginal 
ratios, such as that in biological applications to SAGE. A better fit helps recovering more precise mixed 
memberships of genes to themes, as well as finding tighter clusters, with respect to the independence model. 

In this section, we support our claims with 3 sets of experiments: (1) In simulative experiments, we 
showed that DiP is better at recovering membership than the independence model when realistic SAGE 
mean/variance ratio holds; (2) in small samples bearing realistic SAGE characteristics, although the re- 
covered clusters differ only slightly, the estimated mixed-membership are sharper using DiP than with the 
independence model (the PoissonL in Cai et al. (2004) does not facilitate estimation); (3) in a real dataset, 
we recovered meaningful gene expression profiles according to an empirical evaluation scheme adopted in 
Cai et al. (2004) (e.g., rhodospin and photoreceptors in same cluster), and obtained a reasonable estimate of 
the total number of salient expression themes. As in many biological clustering task, objective comparison 
of clustering results are difficult, but the fact that DiP gets less fragmented clustering (15 versus 30 clusters 
by PoissonL), and sharper cluster mixed-membership estimates suggests that it is a more reliable theme 
identification model. 

5.1 Simulated Data 

We first validate our models by examining to what extend they can recover the mixed-membership probabil- 
ities {On}, i.e., the soft cluster assignments of each gene, under various simulated conditions. We generated 
the ground truth using our generative processes, and we focused on scenarios where the "mean" expression 
level at the various epochs was lower than its corresponding "variance" — a realistic biological experimen- 
tal scenario. We compare our models, normalized DiP and conditional DiP, with two other methods, the 
independence model Pritchard et al. (2000); Minka and Lafferty (2002); Blei et al. (2003), and the Pois- 
sonL model Cai et al. (2004). Our models yield higher likelihoods of expression profiles in the test set (not 
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shown), and more accurate predictions of the latent theme id of each gene based on their observed expres- 
sion levels. Out of 1000 genes we simulated, for example, nDiP and cDiP achieved 75.95% and 70.32% 
accuracy, respectively, whereas the independence model reached only 63.25%. 



5.2 A 20-gene Synthetic Data Set 

Here we report our analysis of a small dataset used in Cai et al. (2004), which contains the expression 
profiles of 20 genes over 5 temporal epochs. Eighteen of the 20 genes belong to one of 4 clusters (temporal 
themes), and the 2 remaining two are identified as outliers. 




Figure 2: The raw example data in Cai et al. (2004), on the original expression scale (left); on a normalized 
expression scale, by gene, into [0, 1] (center); and on a normalized expression scale, by epoch, using di-T 
(right). 
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The expression profiles are generated from 6 different latent themes, or clusters, which the authors reduce 
to 4 by ignoring the abundance of the gene tags observed on the transcripts sampled at each epoch. In 
particular, there are 3 profiles from theme 1, 4 from theme 2, 6 from theme 3, and 6 from theme 4. The raw 
data is plotted in Figure 2 on various scales. Among the profiles from theme 2, there is 1 with 10 times as 
many gene tags as the others, and similarly for theme 3 — number 7 and number 13 in Figure 3. Note that 
these 2 profiles are "more expressed" but they follow an expression theme similar to the other expression 
profiles in the respective clusters. 

In Figure 3, we display the 4 themes learned by the normalized and conditional DiP models (bottom-left 
panel), versus those learned by PoissonL Cai et al. (2004) and the independence model (top-left panel). A 
rough eyeballing shows that the gene expression themes learned by DiPs and the two competing methods are 
similar. However, a close examination reveals the following. Arguably, we obtain a more compact themes 3, 




DIP Conditional 



Theme 1 Theme 2 Theme 3 Theme 4 

Figure 3: Left: Latent gene expression themes learned by different algorithms. Top: 4 themes (numbered 
1 to 4 from left to right) learned by PoissonL and the independence model. Each theme is represented by 
the expression profiles of all the genes assigned to that theme base on MAP prediction using the estimated 
mix-membership vector On- In this case, PoissonL and the independence model give the same membership 
prediction. Bottom: The 4 themes discovered by normalized DiP and conditional DiP. Note that due to 
overlap of the profile curves, the "occupancy" number of each theme is not apparent here. But in Fig. 3, one 
can see it more clearly. Right: The estimated membership probabilities, {9nk}^ for the independence model 
(top), nDiP (middle), and cDiP (bottom). Each row correspond to a theme, and each column corresponds to 
a gene. The color shades of the cells correspond to values ranging from 1 (black) to (white). The panel 
shows that cDiP yields the sharpest estimates. 
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as revealed by the lower degree of dispersion among genes assigned to this theme; but for theme 2, the genes 
assigned to it by the independence model and PoissonL are slightly more consistent. Overall, the software 
clustering assignment of each gene are compatible across all 4 algorithms, and as shown in Figure 3), but 
the mixed-membership probabilities inferred by the DiPs for each gene are sharper. If we compare the MAP 
assignment of each gene to a single most probable themes, the 19 of the 20 genes are consistent across all 4 
algorithms, and their assignments agree with the true themes label given by the original dataset. The remain 
one, gene no. 10, is intriguing. It has an expression profile, {1^10^} = (4, 10, 16, 14, 6), and is originally 
labeled as from theme 2, {A2 ^} = (10, 30, 30, 60, 10). Apparently profile {1^10^} exhibits great variability 
with respect to its supposedly underlying theme. Using DiP, we infer the label of gene no. 10 to be theme 
3, which has a prototype profile {A3 ^} = (10, 10, 10, 10, 10), and indeed we found much of the variability 
in gene 10 is related to the overall abundance of all genes in different epochs, rather then its intrinsic trend. 
So we feel this assignment is arguable more plausible the the purported theme 2. As shown in Figure 3), the 
independence model inferred a split assigned, about equally probable to theme 2 and 3. 

To summarize, this little example is meant to show the role of realistic model properties in latent alloca- 
tion tasks. The intuition is that if the model cannot express, on average, the salient properties of the data, then 
it may lead to artif actual effects. Specifically, the unexplained variability will need to find a "place-holder", 
and it will typically tend to increase the variability of parameter estimates. 

5.3 Mouse Retinal SAGE Profiles 

Here we go back to the motivating case study we introduced in Section 2.2 — the mouse retinal SAGE 
libraries analyzed in Cai et al. (2004), which contains 38,818 unique genes for total of 10 epochs. We first 
perform model selection via a five-fold cross validation, to estimate the plausible number of latent themes 
that best explain the data. The held-out likelihood peaked at 15 themes for cDiP, and 10 for the independence 
model. Figure 4 shows the prototype gene expression profile for each of the 15 themes due to cDIP. The 
variance of each theme are not shown, because in many cases they are so small that the variance-bars are 
masked by the "dot" symbol in our plots. Notably, we found that the magnitude of the held-out likelihood 
for cDiP is about ten times larger (on the log scale) than that for the independence model, suggesting better 
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a fit of DiP to the data. Furthermore, the corresponding mixed-membership estimates of {On} are more 
sharply peaked (as seem before in Figure 3). This is also confirmed by the estimates of Dirichlet hyper- 
parameter, aindep = 1.355 versus aoiP = 0.066. The themes (or clusters) shown in Figure 4 indeed lead 
to reasonable predictions of mouse retinal gene functions. For example, a preliminary biological validation 
of our clustering based on the GO annotation shows correlation between the latent themes and the functions 
for genes such as photoreceptors and rhodospin, i.e., genes with similar functional annotations tend to fall 
into the same theme in our analysis. An in-depth analysis of the biological significance of these clusters is 
given elsewhere. 




Figure 4: Gene expression themes learned from mouse retinal SAGE using conditional DiP. 
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6 Related Work 



Mixed membership analysis — attribute data 



Here we discuss the connection between our algorithm and the PoissonC and PoissonL algorithms intro- 
duced by Cai et al. (2004). In the problem at hand we want to allocate the observed temporal expression 
profiles {Y^-^}^^^ into, say, K themes or clusters. Recall that the K-means unsupervised clustering algo- 
rithm searches for K mean profiles mi-K that minimizes 

K N 



k=l n=l 



That is, the mean profiles mi^x are centers of respective clusters in the sense of Euclidean norm. The 
PoissonC and PoissonL algorithms introduced by Cai et al. (2004) substitute the euclidean norm in the 
equation with the chi-squared score. 



and the negative log-likelihood. 



T 



e 



i{n^k) = log J- 1 , (8) 



t=i 



respectively. Our normalized model based on the Poisson distribution is an extension of the PoissonL 
algorithm, where we introduce Dirichlet distributed mixed-membership vectors, 0^, not known in advance. 
In the PoissonL algorithm the mixed-membership vectors On are known, i.e., for the n-th gene we can write 



1 if A: = 2n 
otherwise, 



where jVi = argmin { L(n, fc) : G [1, }. This extension is similar in spirit to that introduced by 
Gaussian mixture to regular K-means. In fact, we have 

Qnk — P'^ ( cluster = k | data ^ parameter s ) . 
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Note that introducing latent Dirichlet distributed mixed-membership vectors, On, ties together all the data 
in the inference task. This has the beneficial effect of reducing the variability of profile specific parameters 
as we make use of all the gene counts (independently of which profile they express the most) in estimating 
each such parameters. Such an improvement in the estimates is expected James and Stein (1961). 

Our basic Poisson model is similar to that of Canny (2004). For a technical survey of related latent 
aspects models see Buntine and Jakulin (2006). 

7 Conclusions 

In problems where features co-occur frequently (e.g., a gene can be present on multiple transcript, as picked 
up by SAGE), computational gains are hardly warranted. Applications to problems that arise in computa- 
tional biology, e.g., SAGE and microarray data, are one such case. In this paper, we introduce probabilistic 
models to learn latent expression themes in an unsupervised fashion. Our models capture the notion of 
"contagion" to characterize semantic themes underlying observed feature patterns, such as "biological con- 
text", within a hierarchical Bayesian scheme. We present model variants tailored to different properties of 
biological data, and we outline a general variational inference scheme for approximate posterior inference. 

Our results suggest the possibility of obtaining reasonable predictions of gene functions in an unsuper- 
vised fashion. The estimates our models provide, in scenarios that feature realistic variability profiles for the 
data, are sharper than those entailed by existent methods based on stronger independence assumptions, and 
demonstrate feasibility of a promising hierarchical Bayesian formalism for soft clustering and latent aspect 
analysis. 
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